YOLOv5-LiNet: A lightweight network for fruits instance segmentation

To meet the goals of computer vision-based understanding of images adopted in agriculture for improved fruit production, it is expected of a recognition model to be robust against complex and changeable environment, fast, accurate and lightweight for a low power computing platform deployment. For this reason, a lightweight YOLOv5-LiNet model for fruit instance segmentation to strengthen fruit detection was proposed based on the modified YOLOv5n. The model included Stem, Shuffle_Block, ResNet and SPPF as backbone network, PANet as neck network, and EIoU loss function to enhance detection performance. YOLOv5-LiNet was compared to YOLOv5n, YOLOv5-GhostNet, YOLOv5-MobileNetv3, YOLOv5-LiNetBiFPN, YOLOv5-LiNetC, YOLOv5-LiNet, YOLOv5-LiNetFPN, YOLOv5-Efficientlite, YOLOv4-tiny and YOLOv5-ShuffleNetv2 lightweight model including Mask-RCNN. The obtained results show that YOLOv5-LiNet having the box accuracy of 0.893, instance segmentation accuracy of 0.885, weight size of 3.0 MB and real-time detection of 2.6 ms combined together outperformed other lightweight models. Therefore, the YOLOv5-LiNet model is robust, accurate, fast, applicable to low power computing devices and extendable to other agricultural products for instance segmentation.


Introduction
The agricultural sector is one major driver of any economy that has to cope with the increasing food consumption as a result of an increase in population. Fruit is an important agricultural product that is not exempted of these consumer demands. The annual worldwide production of some fruits reported in the year 2020 is estimated over 841 million metric tons according to Shahbandeh [1]. An automatic recognition system that comprised of computer vision and personal computer (PC) was introduced to agriculture for the improvement of fruits production. For example, the visual detection techniques used in the horticulture research field to understand fruit-related phenotypic traits, such as number, size, shape and color has replaced the traditional way for monitoring fruit phenotypes, which is destructive and time-consuming. The computer vision captures fruits images while the PC with an integrated deep learning recognition model is used to recognize and locate the target fruits in an image. Using harvesting robot as a case study, the obtained detection results through the recognition model serves as a guide for a manipulator to pick or harvest the fruits. However, the recognition model of either a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 fruits detection or instance segmentation is faced with some impeding factors of complex and changeable environment. To meet the goals of vision-based understanding of images, it is expected of a robust recognition model to be fast, accurate and lightweight for a low power computing platform deployment. This paper proposes a lightweight YOLOv5-LiNet model for fruits instance segmentation based on YOLOv5n to address the shortcomings. The contributions are as follows: 1. A robust cucurbit fruits image dataset with bounding polygon annotation was produced for comparative experiments towards instance segmentation accomplishment.
2. Replace the first layer of the backbone with Stem network to effectively improve the feature expression capability without adding too much computational cost.
3. Incorporate the ShuffleNetv2 network to mix the extracted features, reduce the computational cost and parameters while maintaining accuracy with an improved speedup.
4. The introduction of ResNet network is to improves the efficiency of deep neural networks while minimizing degradation.
5. The application of EIoU loss function is to bring significant and consistent improvements to detection performance.
The rest of this paper is as follows: Section 2 focuses on the work related to fruit detection and instance segmentation. Section 3 describes the details of dataset, proposed model and experiment. Section 4 provides the compared results and discussion of models, and Section 5 concludes.

Related work
The computer-based recognition model produced by deep learning with convolutional neural networks (CNN) has been able to attain state-of-the-art accuracy, sometimes exceeding human-level with well-known performance in image classification [2], object detection [3,4] and instance segmentation [5]. Object detection can simultaneously classify and localize each target using a bounding box, and is capable enough to deal with multi-class scenario. With this, deep learning with computer vision has significantly improved the production of fruits through fruit detection for yield prediction, yield estimation, harvesting robot platform, fruitquality detection, ripeness identification etc. according to Koirala et al. [6], Koirala et al. [7] and Lawal [8,9]. Notwithstanding, fruit detection scenarios have rectangular bounding boxes and cannot accurately estimate area or perimeter of target from image, in this case, instance segmentation was introduced to consolidates object detection. The instance segmentation technique is more granular with every pixel of given object characterization, that determines the target shape.
Mask-RCNN proposed by He et al. [5] is a deep learning architecture of two-stage detector commonly used for instance segmentation. RiceNet based on improved Mask-RCNN was introduced by Shang et al. [10] for adhesive rice grains segmentation. The RiceNet with few structural parameters recorded an accuracy and recall rate of 89.5% and 92.6% respectively but the target category is single and was not tested on fruits. Liu et al. [11] reported an accuracy of 89.47% and detection time of 346.1 milliseconds (ms) on improved Mask-RCNN for cucumber instance segmentation. However, the speed is slower and robustness is questionable due to one specified category. Yu et al. [12] demonstrated an improved universality and robustness using Mask-RCNN to detect ripe and unripe strawberries but also with slower detection speed. The proposed convolutional encoder-decoder network by Ilyas et al. [13] used adaptive receptive field, channel selection module and bottleneck module to realize accurate recognition of strawberry fruit maturity and diseased fruit but the model could not segment a single target. The optimized Mask-RCNN conducted by Jia et al. [14] on persimmons instance segmentation achieved mean average precision (mAP) and mean average recall (mAR) of 76.3 and 81.1%, respectively. The proposed model is said to be a lightweight network using Mobile-Netv3 [15] as backbone, but was not tested for detection speed to ascertain the performance, and accuracy requires further improvement. A significant improvement of Mask-RCNN for segmentation of fruit and vegetables was reported by Hameed et al. [16]. However, the experiment may have limitation in cases where the supermarket environment is different from natural environment. Interestingly, most of the research conducted on fruits instance segmentation applied Mask-RCNN, whose model weight tends to be large with slower detection speed. Little or no literature was reported using a single-stage detector for fruits instance segmentation in recent. According to Koirala et al. [7], the speed of a single-stage is faster than two-stage detector, and a fast detector is attributed to the lightweight size of model with reference to Lawal [8].
A single-stage detector DaSNet-v2 of lightweight was experimented by Kang and Chen [17]. It combined fruit detection and instance segmentation, and semantic segmentation of branches into a single network architecture to realize an accurate recognition of fruits in complex orchard environment. At the same demonstrated a weight size of 8.1MB with inference time of 55 ms but still need further improvement. The use of bounding polygons for instance segmentation was first developed by Hurtik et al. [18] named Poly-YOLO. It generates a number of flexible points for the bounding polygons of an object that allows network to be trained for general objects shapes and optimizes the conventional hyper-column to attain a lower loss with the modification of feature maps fusion. As a result of this recent trend, Mirror-YOLO was proposed by Li et al. [19] for the instance segmentation and detection of mirrors. Mirror-YOLO achieved a better performance compared to other existing mirrors detection technique. The motivation behind this work led to the recent introduction of YOLOv5 segmentation by Jocher et al. [20]. YOLOv5 have shown to be outstanding, particularly in lightweight size and speed using the its detection platform, yet to be investigated for fruits instance segmentation. Therefore, it is necessary to develop and evaluate a lightweight fruits instance segmentation model using YOLOv5 framework with special attention to accuracy and speed. Meanwhile, the actualization of lightweight network depends on the application of comparative simpler network structure such as MobileNet (MobileNetv1 [21]; MobileNetv2 [22]; MobileNetv3 [15]), SqueezeNet (SqueezeNet [23]; SqueezeNext [24]), ShuffleNet (ShuffleNetv1 [25]; Shuf-fleNetv2 [26]), and YOLO-tiny (YOLOv3-tiny [27]; YOLOv4-tiny [28]; YOLOv5n [20]) and so on. For the computer vision system aiming at accurate location and segmentation has a vital role for various agricultural applications [29,30].

Dataset
With special consideration to reflection, shadows, low light cloudy and high light of environmental factors, the images of cucurbit fruit dataset used in this paper were obtained from wanghaizhuang greenhouses, Jinzhong, Shanxi, China, which are publicly open to society. Cucurbit is a family of fruit plant that shared similarity in ground development and have high genetic diversity in shape, color and size, making the intelligent perception and acquisition of their information most difficult for the fruit instance segmentation. Nevertheless, they are a good source of many nutrients to the human body. For this work, the classes of cucurbit images captured are bitter-melon, cucumber, muskmelon and melon-boyang. These images were taken using a regular 3968×2976 pixels digital camera in the morning, midday and afternoon. A 665 of bitter-melon, 664 of cucumber, 404 of muskmelon and 736 of melon-boyang, making a total of 2469 images were captured, including complex conditions: leaf occlusion, superimposed fruit, dense target, branch occlusion, backlight, front light and other fruit scenes. The collected images were stored in JPG and randomly divided into 80% train-set, 15% valid-set, and 5% test-set. Later, all the ground truth bounding polygons of each target in an image was manually hand labeled using Labelme [31] annotation tool. The purported shape of the target was drawn neglecting the image complex and changeable condition, and annotation files saved in coco format. The obtained coco annotation was converted into poly-YOLO format. The format first takes object class number followed by xy to x n y n for instance segmentation, where xy is the coordinate for n polygon point of mask.

YOLOv5-LiNet
The lightweight YOLOv5-LiNet is designed based on the original YOLOv5n architecture for fruits instance segmentation. It combines the backbone, neck and head network of 0.33 depth and 0.25 width multiple. Generally, backbone network aggregates and forms image features at different granularities. The LiNet backbone of YOLOv5-LiNet shown in Fig 1 comprised of Stem [32], ResNet [33], Shuffle_Block [26] and SPPF [20]. The Stem structure in  Pointwise group convolution used a single convolutional filter per each input channel, while the channel shuffle enables information communication between the two concatenated network branches to improved performance. The extracted feature maps after ResNet network were passed to the SPPF in Fig 2. The SPPF is an improvement on spatial pyramid pooling (SPP) [35], which consists of Conv layer with BN layer and SiLU activation, and Maxpool layer. It is a feature enhancement network that extracts the major information of feature map and performs stitching to reduce loss of target detection. SPPF is faster with less giga floatingpoint operations per second (GFLOPs) compared to SPP according to Jocher et al. [20]. The stated components of YOLOv5-LiNet backbone were chosen to foster less parameters and GFLOPs toward an improve detection accuracy and speed with smaller weight size of model.
Neck network is an important aspect of fruits instance segmentation model that is used to get feature pyramids, and for multi-scale feature extraction in the target detection process. Fig  1 provides different types of neck network used in this paper for ablation study. The path aggregation network (PANet) [36] was fed to YOLOv5-LiNet as neck in order to promote and maintain a balance between accuracy and speed. PANet enables a well-generalized model on object scaling with an incorporation of C3 network shown in Fig 2 and enhances multi-scale fusion. This is to improve detection accuracy. The bottleneck of stack two Conv layers 1×1 and 3×3 with skip connections were embedded into the C3 network after second branch Conv layer 1×1 and later concatenated with the first branch Conv layer 1×1, followed by Conv layer 1×1 to improve detection performance. Meanwhile, each Conv layer is associated with BN layer and SiLU activation. Similarly, the map features from the backbone network were forwarded to neck networks in Fig 1 for convolution and up-sample to produce double image dimensions for concatenation. The concatenated information passes to C3 network for output detection. This process is repeated till small, medium and large level are produce. The head network is the final output of detection. It output both fruit detection and instance segmentation through small, medium and large scale that consumes features from the neck. It adopts bounding polygons (anchors) on mapped features with the probability of the fruit target class, score and position, and non-maximum suppression (NMS) to select the appropriate fruit target and remove redundant information. To measure the quality of model prediction and show the gap between predicted and actual value, Efficient intersection-over-union (EIoU) loss function (see more details by Zhang et al., 2022) [37] was applied to lightweight YOLOv5-Li-Net against the commonly used CIoU [38]. EIoU directly measures the overlap area, central point and side length of targets, and anchor for convergence speed and localization accuracy.

Experiment
This experiment deploys python 3.19.13 and torch-1.11.0+cu113 deep learning framework for model training and testing on a computer with an Intel Core i7-12700 CPU @ 64-bit 4.90 GHz, 32 GB RAM, NVIDIA GeForce RTX 3060 12045MiB GPU graphics card and ubun-tu22.04LTS operating system. Table 1 provides the details of all the trained models. Using the general procedures for network training on YOLOv5 platform, the proposed lightweight YOLOv5-LiNet including other YOLO related models takes an input of 512×512 pixels, 16 batch size, 0.937 momentum, 0.0005 weight decay, 0.2 IoU, 0.015 hue, 0.7 saturation, 0.4 lightness, 1.0 mosaic and 300 epochs training, while Mask-RCNN received an input of 512×512 pixels with default parameters on MMdetection platform. Random initialization technique was used to initialize the weights for training all the models from scratch.

Evaluation
This paper used Precision, Recall, F 1 -score and mean Average Precision (mAP) as the evaluation metrics, set at 0.5 IoU threshold. A predicted bounding polygon is correct (true positive) if it overlaps more than the IoU threshold with a labeled bounding polygon, else the predicted bounding polygon is considered false positive. Likewise, it is considered false negative when the labeled bounding polygon has an IoU with a predicted bounding polygon lower than the threshold value. Precision is the ratio of correctly detected fruit to the total number of detected fruits. Recall is the ratio of correctly detected fruit to the total number of fruits in the dataset. F 1 -score is the trade-off between Precision and Recall to show the model performance and mAP is the overall performance under different confidence thresholds [8]. The metrics can be defined as below: Recall ¼ TP TP þ FN ð2Þ

Results and discussion
After the network training, the obtained validation loss for box and segmentation is presented in Figs 3 and 4 respectively. This is because validation loss measures how good the model fits valid set (new data) or predict. It was observed that the segmentation loss of all models in Fig 4 is lower than box loss in Fig 3, which is attributed to the bounding polygons that provides the actual shape of target. At the same time, the loss variations between models in Fig 4 is less than  Fig 3. This is to justify the difference between using bounding boxes and polygons. The calculated F 1 -score for box and segmentation are displayed in Figs 5 and 6 respectively. Figs 5 and 6 shows that the F 1 -score of proposed YOLOv5-LiNet is more than other models, where YOLOv5-ShuffleNetv2 displayed the least F 1 -score. The mAP is more accurate than F 1 -score because it measures the global relationship between Precision and Recall. The depicted Figs 7 and 8 respectively show the mAP of box and segmentation. Just like F 1 -score, YOLOv5-LiNet outperformed other models in mAP. However, the displayed figures under F 1 -score and mAP for box is higher than that of segmentation. This is as a result of the complexity of polygon point of segmentation compared to rectangular point of the box.
The lightweight models were evaluated on test-set using four batch of images, and the obtained findings are shown in Figs 9-18. A number of target fruits were detected and instantly segmented in the tested images without missed detection, showing robustness under various conditions. This is to prove the effectiveness of SPPF added into the models. Nevertheless, the level of detection accuracy varies from one to the other.  with EIoU loss is more accurate compared to YOLOv5-LiNetC with applied CIoU loss function despite having the same network structure. This indicates that EIoU is better than CIoU as loss function, and requires more investigation. The summary performance of tested compared models is shown in Table 2. Detection speed and accuracy are the main factors used to examine performance. Model weight and speed depends on layer for network topology, GFLOPs for speed of network, and size for weight of network, while accuracy is based on F 1 -score and mAP. Excluding the layer of Mask-RCNN, the obtained layer of YOLOv5-GhostNet is larger than other models, where YOLOv4tiny is observed to have the least layer. The results of GFLOPs correspond to weight size derived through the parameter of a model. The obtained GFLOPs and weight size of Mask-RCNN is very large compared to YOLO-related models. This is to say that the lightweight size of a single-stage detector is far lesser than two-stage detector. Based on the YOLO-related models, the level of lightweight size is measure as YOLOv5n is greater than YOLOv5-Ghost-Net, YOLOv5-MobileNetv3, YOLOv5-LiNetBiFPN, YOLOv5-LiNetC, YOLOv5-LiNet, YOLOV5-LiNetFPN, YOLOv5-Efficientlite, YOLOv4-tiny and YOLOv5-ShuffleNetv2. This variation of weight size influences the tested real-time detection of model to support the claim of Lawal [8]. Apart from Mask-RCNN unable to meet the less than 50 ms standard of real-time detection proposed by Zhang et al. [40], all YOLO-related models were able to achieve this standard as shown in Table 2. YOLOv5-LiNet and YOLOv5-LiNetC having the same detection time of 2.6 ms is faster than 55.6 ms of Mask-RCNN, 2.8 ms of YOLOv5n, 3.4 ms of YOLOv5-Efficientlite, 3.5 ms of YOLOv5-MobileNetv3, 2.9 ms of YOLOv5-GhostNet, 2.9 ms of YOLOv5-Li-NetBiFPN but slower than 2.4 ms of YOLOv5-LiNetFPN, 2.4 ms of YOLOv5-ShuffleNetv2 and 2.2 ms of YOLOv4-tiny. Nevertheless, the detection time of YOLOv5-LiNet is in close proximity with YOLOv5-LiNetFPN, YOLOv5-ShuffleNetv2 and YOLOv4-tiny. Adding detection accuracy to the resulting detection time serves to finalize the assessment of model performance. With reference to mAP, the stated results in Table 2 on accuracy is similar to the displayed results in Figs 7 and 8. Under mAP of box, 0.893 of YOLOv5-LiNet is 0.2%, 0.3%, 1.1%, 1.5%, 2.3%, 3.2%, 5.0%, 7.1%, 7.7% and 8.0% higher than YOLOv5-LiNetC, YOLOv5n, YOLOv5-Li-NetBiFPN, YOLOv5-GhostNet, YOLOv5-LiNetFPN, YOLOv5-Efficientlite, YOLOv5-Moblie-Netv3, Mask-RCNN, YOLOv4-tiny and YOLOv5-ShuffleNetv2 respectively. For mAP of instance segmentation, 0.885 of YOLOv5-LiNet is 0.5%, 1.0%, 1.3%, 2.2%, 2.6%, 3.3%, 5.6%, 5.8%, 7.2% and 7.5% more than YOLOv5-LiNetC, YOLOv5n, YOLOv5-LiNetBiFPN, YOLOv5-LiNetFPN, YOLOv5-GhostNet, YOLOv5-Efficientlite, YOLOv5-MoblieNetv3, Mask-RCNN, YOLOv4-tiny and YOLOv5-ShuffleNetv2 respectively. Owning to the outstanding performance of YOLOv5-LiNet compared to other models, the ablation study investigated using different neck network and loss function show that PANet > BiFPN > FPN and EIoU > CIoU respectively. The recorded mAP of instance segmentation on YOLOv5-LiNet increases by 0.5% using EIoU loss from YOLOv5-LiNetC, 1.3% using PANet from YOLOv5-LiNetBiFPN and 2.2% using PANet from YOLOv5-LiNetFPN. Additionally, YOLOv5-LiNet shows a better performance in terms of lightweight against proposed by Kang and Chen [17], Hurtik et al. [18] and Li et al. [19], accuracy and speed compared to state-of-art YOLOv5n and Mask-RCNN. For this reason, the YOLOv5-LiNet model is robust against the complex environment, accurate, fast, and applicable to low power computing devices embedded with computer vision.

Conclusion
A lightweight YOLOv5-LiNet model for fruit instance segmentation has been suggested in this paper to consolidate fruit detection, based on the modified YOLOv5n for improved fruit production. The model comprised of Stem, Shuffle_Block, ResNet and SPPF as backbone network, PANet as neck network, and EIoU loss function to improve detection performance. At the same time, a robust cucurbit fruits image dataset with bounding polygon annotation was produced for comparative experiments on the proposed model. The ablation study carried out on YOLOv5-LiNet shows that the performance of applying PANet > BiFPN > FPN and EIoU > CIoU. YOLOv5-LiNet was compared with original YOLOv5n, YOLOv5-GhostNet, YOLOv5-MobileNetv3, YOLOv5-LiNetBiFPN, YOLOv5-LiNetC, YOLOv5-LiNet, YOLOv5-LiNetFPN, YOLOv5-Efficientlite, YOLOv4-tiny and YOLOv5-ShuffleNetv2 of lightweight model including Mask-RCNN. The obtained results demonstrated that YOLOv5-LiNet with 0.893 of mAP box, 0.885 of mAP instance segmentation mAP, 3.0 MB of weight size and 2.6 ms of detection time combined together is outstanding in performance compared to other lightweight models. Hence, the YOLOv5-LiNet model is highly robust against complex and changeable environment, accurate, prospective for better generalization and real-time detection, applicable to low power computing devices and extendable to other agricultural products for instance segmentation.